Model Selection

Vision-Text Alignment

# Vision-Text Alignment

Ovis2-1B is the latest member of the Ovis series of multimodal large language models (MLLM), focusing on structural alignment of vision and text embeddings, featuring high performance for small models, enhanced reasoning capabilities, video and multi-image processing, and multilingual OCR enhancement.

Transformers Supports Multiple Languages

Siglip So400m 14 980 Flash Attn2 Navit

SigLIP-based vision model that enhances maximum resolution to 980x980 through interpolated positional embeddings and implements NaViT strategy for variable resolution and aspect ratio-preserving image processing

Chinese Clip Vit Large Patch14

Chinese CLIP model, based on VIT architecture, supports Chinese vision-language tasks

Image Classification

Vit Base Patch16 Clip 224.openai

CLIP is a vision-language model developed by OpenAI, trained via contrastive learning for image and text encoders, supporting zero-shot image classification.

Clip Vit Base Patch16

CLIP is a multimodal model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, enabling zero-shot image classification capabilities.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase